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Abstract:  This  paper  describes  our  participation 
in  the  Federated  Web  Search  track  at  TREC  2014. 

Our  main  focus  is  on  the  resource  selection  task, 
where  we  employ  a  learning-to-rank  approach  to 
combine  various  (instantiations  of)  resource  rank¬ 
ing  models.  Further,  we  show  that  vertical  selec¬ 
tion  can  be  run  on  the  output  from  resource  selec¬ 
tion,  and  that  it  directly  benefits  from  the  improve¬ 
ments  of  thereof. 

1  Introduction 

We  describe  our  participation  in  the  Federated  Web  Search 
track  at  TREC  2014.  Specifically,  we  took  part  in  the  re¬ 
source  selection  and  vertical  selection  tasks.  For  resource 
selection,  our  focus  was  on  finding  a  way  to  effectively  com¬ 
bine  two  principal  strategies.  Collection-centric  (CC)  and 
Document-centric  (DC),  we  developed  in  prior  work  (Ba¬ 
log,  2014).  We  employ  a  learning-to-rank  approach,  where 
various  instantiations  of  the  CC  and  DC  models,  using  dif¬ 
ferent  representations  and  relevance  cutoff  values,  are  used 
as  features.  We  present  our  approach  and  results  in  Sec¬ 
tion  2.  We  base  our  vertical  selection  runs  on  the  outcomes 
of  resource  selection  step.  Specifically,  we  use  the  estimated 
collection  relevance  scores  as  binary  judgments,  thereby  es¬ 
sentially  delegating  the  “selection”  problem  to  the  resource 
ranking  component.  The  method  and  the  results  are  de¬ 
scribed  in  Section  3. 


2  Resource  selection 

In  prior  work,  we  presented  two  approaches  to  the  resource 
selection  task  based  on  generative  language  modeling  tech¬ 
niques  (Balog,  2014).  According  to  the  Collection-centric 
(CC)  model,  each  collection  is  represented  as  a  term  dis¬ 
tribution,  which  is  estimated  from  all  sampled  documents. 
The  second  model.  Document-centric  (DC),  first  scores  indi¬ 
vidual  sampled  documents,  then  considers  the  top-K  ranked 
ones  to  determine  collection  relevance.  Despite  its  rela¬ 
tive  simplicity,  the  DC  model  delivers  solid  performance;  at 
TREC  2013  it  came  very  close  to  the  top  performing  mns  on 


all  metrics  (Demeester  et  ah,  2014).  We  also  experimented 
with  the  combination  of  the  CC  and  DC  strategies  in  our  par¬ 
ticipation  last  year,  using  a  linear  mixture  model,  but  it  did 
not  improve  over  the  DC  model.  This  year  our  aim  is  to  find 
a  way  to  effectively  combine  the  CC  and  DC  models.  To  this 
end,  we  employ  learning-to-rank  techniques. 

2.1  Approach 

We  use  the  scores  estimated  by  the  CC  and  DC  models  as 
features.  Specifically,  we  consider  a  number  of  different 
configurations,  based  on  the  type  of  document  representa¬ 
tion  (title,  snippet,  page)  and  the  cutoff  value  (K,  only  for  the 
DC  model).  In  the  following  subsections,  we  briefly  present 
the  CC  and  DC  models;  for  a  more  detailed  description  we 
refer  to  Balog  (2014).  Additionally,  we  take  collection  size 
to  be  a  feature  as  well  (previously,  it  was  incorporated  as  a 
prior  collection  probability).  Table  1  lists  our  features  (36  in 
total). 

2.1.1  Collection-centric  Model 

Drawing  on  Callan  et  al.  (1995)  and  Si  et  al.  (2002),  this 
approach  treats  each  collection  as  a  single,  large  document. 
Under  the  language  modeling  framework,  the  probability  of 
the  collection  generating  the  query  is  expressed  as  follows: 

teq  dec 

(1) 

where  n(t,q )  is  the  number  of  times  term  t  is  present  in  the 
query  q ,  P{t\d)  and  Pit)  are  maximum-likelihood  estimates 
of  the  probability  of  observing  term  t  given  the  document 
and  background  language  models,  respectively,  and  A.  is  a 
smoothing  parameter.  The  background  language  model  is 
estimated  form  all  sampled  documents.  Here,  all  documents 
are  assumed  to  be  equally  important  within  a  given  collec¬ 
tion,  therefore,  P(d\c)  is  set  to  l/|c|,  where  |c|  is  the  number 
of  (sampled)  documents  in  collection  c. 

2.1.2  Document-centric  Model 

Instead  of  creating  a  direct  term-based  representation  of  col¬ 
lections,  we  model  and  query  individual  (sampled)  docu- 
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Table  1 :  List  of  features  used  for  resource  selection. 


Feature 

Description 

CC  r{q,c) 

P{q\c)  estimated  using  the  CC  model  (Eq.  1) 
representations: 

r  =  {title,  snippet} 

DC  r,K(q,c) 

P{q\c)  estimated  using  the  DC  model  (Eq.  2) 
representations: 

r  =  {title,  snippet,  document} 
cutoff  values: 

K  =  {10,20,50,75, 100, 150,200, 

250,300,500,1000} 

snippets  (c) 

Number  of  snippets  in  the  sample  of  c 

ments,  then  aggregate  their  relevance  estimates.  This  ap¬ 
proach  closely  resembles  the  ReDDE  collection  selection  al¬ 
gorithm  (Si  and  Callan,  2003).  Formally: 

p(q\c)  =  YJp(d\c)]j((i~x)pm+^m)n(t,‘l\  (2) 

dec  teq 

where,  as  before,  P(t\d)  and  Pit)  and  the  document  and 
background  term  probabilities,  X  is  the  smoothing  param¬ 
eter,  and  P(d\c  )  is  the  importance  of  the  document  given  the 
collection.  Additionally,  we  apply  a  rank-based  cut-off  and 
consider  only  the  top  K  most  relevant  documents  in  the  sam¬ 
ple  index  for  the  computation  of  Eq.  2. 

2.1.3  Combining  Models 

We  employ  a  listwise  learning-to-rank  approach,  Lamb- 
daMART  (Wu  et  ah,  2010).  For  training  the  machine  learn¬ 
ing  model  we  use  data  from  prior  editions  of  the  TREC  Fed- 
Web  track.  Our  results  in  §2.2  indicate  that  the  choice  of  the 
training  material  has  a  major  impact  on  performance. 

2.2  Runs  and  results 

We  submitted  the  following  runs: 

NTNUiSrsl  Document-centric  model  using  the  entire 
document  text  ( r  =  document)  and  a  cutoff  value  of 
K  =  500.  This  particular  setting  was  chosen  based  on 
a  (non-extensive)  set  of  experiments  performed  on  the 
FedWeb’13  collection. 

NTNUiSrs2  Learning-to-rank  approach  trained  on  the 
Fed  Web’  13  data  set. 

NTNUiSrs3  Learning-to-rank  approach  trained  on  the 
Fed  Web’  12  and  ’13  data  sets. 

Table  2  presents  the  results.  We  find  that  the  learning-to-rank 
approach  trained  on  FedWeb’13  outperforms  the  DC  model 
by  over  13%  in  terms  of  the  official  metric,  nDCG@20 
(NTNUiSrs2  vs.  NTNUiSrsl).  Interestingly,  when  training 
was  done  on  both  FedWeb’  12  and  ’13  performance  dropped 


Table  2:  Results  for  our  official  resource  selection  runs.  Best 
scores  for  each  metric  are  in  boldface. 


Run 

nDCG@20 

nDCG@10 

P@1 

P@5 

NTNUiSrsl 

0.306 

0.225 

0.148 

0.195 

NTNUiSrs2 

0.348 

0.281 

0.206 

0.257 

NTNUiSrs3 

0.248 

0.205 

0.202 

0.189 

substantially  (NTNUiSrs3  vs.  NTNUiSrsl).  Discriminative 
learning  is  indeed  a  promising  direction  for  this  task,  but  fur¬ 
ther  research  is  needed  to  understand  how  the  training  mate¬ 
rial  should  be  composed.  It  is  also  left  to  future  work  to  ex¬ 
periment  with  different  learning-to-rank  algorithms,  specifi¬ 
cally  pointwise  and  pairwise  approaches. 

3  Vertical  selection 

3.1  Approach 

Our  choice  of  method  for  the  vertical  selection  task  is  closely 
tied  to  our  resource  selection  approach.  We  assume  that  re¬ 
source  selection  produces  a  relevance  score  s(g,c)  for  each 
collection  such  that 

,  \_/>0  cis  relevant 

■W‘LC)  —  'y  <  ()  c  is  nonrelevant 

Then,  we  simply  select  all  collections  that  have  a  positive 
relevance  score: 

V(q)  =  {c|s(<?,c)  >  0},  (4) 

where  V  ( q )  denotes  the  set  of  selected  verticals  for  query  q. 
In  a  way,  we  delegate  the  “selection”  problem  to  the  resource 
ranking  component. 

3.2  Runs  and  results 

We  submitted  the  following  runs: 

NTNUiSvs2  Based  on  resource  selection  run  NTNUiSrs2. 
NTNUiSvs3  Based  on  resource  selection  run  NTNUiSrs3. 

Table  3  displays  precision  (P),  recall  (R),  and  FI -measure 
(FI)  for  our  submitted  runs.  Based  on  these  results,  we  make 
the  not  surprising  observation  that  better  resource  selection 
indeed  leads  to  better  vertical  selection.  The  scores,  how¬ 
ever,  are  quite  low  in  absolute  terms,  which  suggests  that  the 
scores  produced  by  the  resource  selection  approach  may  not 
satisfy  the  criteria  that  we  have  specified  regarding  the  signs 
of  collection  scores  (cf.  Eq.  3).  We  hypothesize  that  using 
a  simple  score-based  thresholding  (i.e.,  changing  the  value  0 
to  a  parameter  in  Eq.  3)  might  alleviate  this  issue.  It  might 
also  be  the  case  that  the  underlying  resource  selection  step 
needs  to  be  casted  as  a  classification  task  as  opposed  to  a 
ranking  problem. 


Table  3:  Results  for  our  official  resource  selection  runs.  Best 
scores  for  each  metric  are  in  boldface. 


Run 

P 

R 

FI 

NTNUiSvs2 

0.157 

0.406 

0.205 

NTNUiSvs3 

0.145 

0.281 

0.177 

4  Conclusions 

We  described  our  participation  in  the  TREC  2014  Federated 
Web  Search  track.  For  resource  selection  we  have  exper¬ 
imented  with  a  discriminative  learning  approach  for  com¬ 
bining  numerous  instantiations  of  resource  selection  mod¬ 
els.  We  have  shown  that  it  can  outperform  a  competitive 
baseline  model,  but  is  sensitive  to  the  choice  of  the  underly¬ 
ing  training  material.  We  have  used  the  estimated  collection 
relevance  scores,  as  binary  judgments,  to  make  a  selection 
of  verticals.  We  have  found  that  improvements  in  resource 
selection  indeed  translate  to  better  vertical  selection  perfor¬ 
mance.  At  the  same  time,  making  a  binary  judgement  about 
the  relevance  of  a  collection  remains  to  be  challenging,  given 
that  resource  selection  is  approached  as  a  ranking  problem, 
and  not  as  a  classification  task. 
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